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ABSTRACT 

It is increasingly recognized, following the lead of 
J. J. Cannell, that actual gains in educational achievement may be 
much more modest than dramatic gains reported by many state 
assessments and many test publishers. An overview is presented of 
explanations of spurious test score gains. Focus is on determining 
how test-curriculum alignment and teaching the test influence the 
meaning of scores. Findings of a survey of state testing directors 
are summarized, and the question of teaching the test is examined, 
some frequently presented exp?.anations refer to norms used; others 
refer to aspects of teaching the test. Directors of testing from 46 
states (four states conduct no state testing) replied to a survey 
about testing. Forty states clearly had high-stakes testing. TT* most 
pervasive source of high-stakes pressure identified by respondents 
was media coverage, responses indicate that test -curriculum alignment 
and teaching the test are distorting instruction. A possible solution 
is to develop new tests every year, changing the tests rather than 
the norms. Two tables present explanations for test score inflation 
and selected survey responses. (SLD) 
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Cannell's 1987 report attacked the credibility and 
integrity of nationally normed standardized achievement test 
results. According to his survey, all 50 states claim to be 
above the national average, and an estimated 70 percent of 
students nationally are told they are performing above 
average. Cannell found these results illogical and 
inconsistent with other indicators of educational quality. 
Although he had heard the counterexplanation that "high 
scores reflect improved achievement levels," he argues that 
inaccurate initial norms and teaching the test were more 
likely causes of high scores. 

Responses to Cannell from educational policymakers and 
from test publishers were of three types: 1) his data are 
wrong (or he doesn't understand statistics); 2) his negative 
inferences are wrong, high achievement scores are real; or 3) 
he's right, test scores are very likely inflated by factors 
such as outdated norms and too much familiarity with the 
tests. Bob Linn's paper addressed the validity of the first 
two rebuttals. Linn et al.'s (1989) analysis provides more 
exhaustive consideration of subject areas and grade levels, 
more statistically defensible treatment of reported test 
score distributions, and a more representative sample of 
sci.ool district data. 

Nonetheless, he confirms Cannell's basic conclusion. 
Considering reported results from the 35 states with 
nationally normative comparisons, "the overall percent of 
students above the national median is greater than 50 in all 
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of the elementary grades in both reading and mathematics for 
each of the three years studied. 1 * (Linn et al., 1989, p. 8). 
Use of the median rather than mean precludes esoteric 
discussion about skewed distributions. By definition, 50 
percent of students should be on each side of the median. 
Thus only two contentions are possible: Either achievement 
has gone up since the base year or something is amiss. 

Linn also presents data that contradict the claim that 
all of the apparent gains are real. National trend data 
provided by the National Assessment of Educational Progress 
(NAEP) document gains in achievement that are much more 
modest than the dramatic gains reported by many state 
assessments and by publishers for their normative samples. 
Koretz's (1986, 1987) Congressional Budget Office studies of 
several large-scale databases likewise confirm that 
achievement is improving nationally, but Koretz (1988) 
concluded that the gains reported by standardized tests are 
exaggerated. 

These comparisons to other more credible national data 
both support and contradict Cannell's claims. Yes, achievement 
gains reported to the public based on standardized achievement 
tests appear to be exaggerated. But it is als<* apparent that 
the norms themselves may be inflated; the steep gains from 
1970* s to 1980 's norm groups could be caused in part by over 
subscription of prior users in the recent norm groups (see 
Table 1, Phillips & Finn, 1988) and similarities between old 
and new forms of the tests. This would mean, contrary to 
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Cannell's accusation of collusion and misrepresentation by 
publishers to make schools look good, that the revised norms 
could actually have set too high a standard of comparison in 
the base year. Furthermore, if up-to-date norms carried 
forward this intrinsic response bias, they would continue to be 
too high, not too low. This is an extremely important point 
because it bears on the validity of alternative solutions to 
the problems raised by Cannell. 

If outdated norms are seen as the central problem, then 
annual norms are the answer. Indeed, annual norms and 
educating the public about the "time-bound nature of norms" 
(Williams, 1988) have been the primary responses by test 
publishers and state testing directors in our survey. If, 
however, the problem is spuriously high test scores because 
of too much teaching the test in the face of too much 
accountability pressure, then annual norms will contribute to 
the problem by creating a standard that is more and more 
unattainable by legitimate teaching methods. This tension or 
dilemma is the focus of the paper, as reflected in the title. 
"Inflated test score gains: Is it old norms or teaching the 
test?" 

In this paper I present: 1) an overview of the 
explanations given for spurious test score gains and 2\ an 
encapsulated summary of findings from our survey of state 
testing directors regarding the narrowing of curriculum and 
teaching the test . Then I return to the dilemma posed by the 
effect of teaching the test on the norms themselves and 
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consider what solutions should be pursued if test familiarity 
is seen as the primary problem, rather than outdated norms. 

Explanations for Spurious Test Score Gains 

Cannell first released his report in November of 1987; 

the Summer 1988 issue Of Educat i ona 1 Measurement. ; ISSUeS and 

Practi ce was a special issue devoted to Cannell* s findings 
with commentaries by researchers representing the U.S. 
Department of Education and each of the major test publishing 
firms. Table 1 provides a summary of the elements in those 
responses specifically addressing the possible explanations 
for inflated scores. Explanations 1 and 4 pertain to norms. 
Explanations 2, 3, and 5 refer to aspects of teaching the 
test. 1 Note that Drahozal and Frisbie (1988) and Stonehill 
(1988) speculate about the type of bias that would have to 
occur for non-representative norms to lead to an over- 
estimate of student achievement. In contrast, Phillips and 
Finn (1988) and Lenke and Keene (1988) consider the more 
realistic possibility that normative samples become biased by 
the greater participation rates of user districts thus 
leading to spuriously high norms and an underestimate of 
achievement for naive test takers. Lenke and Keene provide 
direct evidence that user norms are inflated but do not 

1 Our project to replicate and extend Cannell' 3 study, which Linn (19S9) 
described in his paper, was also designed to address explanation 6, the 
extent to which handicapped and limited English speaking students might 
be included more often in the norming sample than in state and district 
accountability testing programs. These results will Le discussed in the 
technical report available in the summer of 1989. 
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themselves apply these findings to argue against the validity 
of annual norms (which publishers would most likely construct 
from user data) . Instead they argue against annual norms as 
a "moving target" that would preclude evaluation of change 
over time. Three respondents suggest annual user-norms as a 
corrective to outdated norms. Three respondents suggest 
fresh tests or test security as the remedy for teaching the 
test. Only one author compares the two problems and their 
respective solutions. Anticipating the theme of this paper, 
Quails-Payne (1988) commented that r "If new forms of 
achievement tests are developed each year, thereby increasing 
test security, the need for annual norms diminishes 
significantly" (p. 22). 

Various M**n 4 no of Teaching the Test 

The phrase, teaching the test, is evocative but in fact 
has too many meanings to be directly useful. Although it has 
a negative connotation to most members of the public, many 
educators take it to mean teaching to the domain of knowledge 
represented by the test. In framing our interview questions 
with state testing directors or their representatives, we 
avoided the pejorative phrase with multiple interpretations. 
Instead, we asked about a wide range of policies and 
practices, beginning with the uses for the test data, the 
process of test selection, time spent on teaching the test 
objectives, and test preparation efforts. 



8 



6 



Hiah-stakes EpgixQaasnt 

It is commonly understood that one of the salient 
characteristics of the educational reform movement of the 
1980's has been high-sta! es testing. Popham (1987) used the 
term, -high stakes," to refer to both tests with severe 
consequences for individual pupils, such as non* promotion, 
and those used to rank schools and districts in the media. 
The latter characterization clearly applies to 40 of the 50 
states. Only four states conduct no state testing nor 
aggregation of local district results (Montana, Nebraska, 
Ohio, and Vermont); two states, Oregon and Wyoming, collect 
state data on a sampling basis in a way that does not put the 
spotlight on local districts. Wisconsin and North Dakota 
report state results collected from districts on a voluntary 
basis. Two additional states were rated as relatively low- 
stakes by their test coordinators 2 ; in these states, for 
example, test results an not typically page-one news nor are 
district rank-orderings published. The testing directors in 
the 10 states without high-stakes state tests were careful to 
point out that their comments did not necessarily apply to 
individual districts within their state where public 
attention to test scores might be extremely intense. 



2 Our interviews were conducted under the agreement with respondents 
that states would not be anonymous when citing matters of fact regarding 
the testing program or policies that could be quoted directly from 
published materials. However, when respondents were asked to state an 
opinion or perception, they, and hence their states, would be anonymous. 
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The most pervasive source of high-stakes pressure 
identified by respondents was media coverage. Presentation 
of test results to the state board is a media event. Each 
local paper then runs its own story on the health of public 
education and ranks the districts within its jurisdiction. 
Other uses of test results that would give them extremely 
high importance where they occur were reportedly infrequent. 
For example, many had heard talk of superintendents or 
principals who were fired because they had been unable to 
raise test scores satisfactorily. Though the talk was 
widespread, contributing in many cases to reported principal 
anxiety, the instances were rare and unverified in all but a 
few cases. Only a few states have financial incentive 
programs where there is some financial reward to schools, 
districts, or teachers which derive from raised test scores. 
Another small number of states have placed districts in 
receivership, based on low test scores among other factors. 
Although none of the states have the coincidence of all of 
these high-stakes pressures, intense media coverage and 
scrutiny from the legislature alone were sufficient for many 
to rate test score results as "very important" or "extremely 
important." These other factors appear to contribute to the 
sense of urgency or pressure associated with test scores even 
if they directly affect only a small portion of educators in 
the state. 
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High stakes do not necessarily mean invalid test scores, 
but they clearly alter the context of testing as suggested by 
Phillips and Finn (1988) . Furthermore, intense pressure on 
educators to improve scores sets the stage and increases the 
incentives for the various types of "teaching the test" 
efforts discussed in the following sections. 



Test-curriculum Alignment 

Without question, published norm-referenced tests are 
selected to achieve the best match possible between the test 
content and the state's curriculum. The following interview 
segment typifies the process described by state directors in 
response to the question, "Who selected the standardized test 
being used?" 



Committees of teachers are set up by grade level so 
that a group of third grade teachers would be 
reviewing tests appropriate to the third grade, and 
then would begin making recommendations as to which 
test is better ir content, format, technical 
characteristics and so forth. We also add to that 
list of teachers groups of technical specialists who 
look at things like norms and so forth, adequacy of 
reporting and scaling and so forth. We also add 
another committee comprised mostly of persons that 
would be curriculum specialists in the central office 
level. And these three groups make independent 
recommendations. HAVE THERE BEEN EFFORTS TO ASSURE 
THAT THE CURRICULUM AND THE TEST ARE ALIGNED? 
Absolutely. That's what each of these three groups 
does. The teachers look at a lot of things like 
formatting and carefulness of construction and look 
for item bias, those kinds of things. But the key 
thing that they look for is alignment with curriculum. 
If the test is not aligned with our curriculum, it 
just gets discarded immediately. 
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The few states with customized tests or home-made tests 
linked to national norms, of course, are able to achieve even 
closer alignment between the state's curriculum outline and 
test content because they are not constrained to select from 
existing tests. 

It is also evident that test-curriculum alignment is a 
reciprocal process. Tnat is, once the test is chosen that 
best fits the curriculum, the practiced curriculum is 
adjusted further in response to the test. Many directors 
emphasized that this was, in fact, the conscious purpose of 
the testing program, to ensure that essential skills are 
taught. Item analysis data are usually provided and 
districts are encouraged to look for areas of weaknesses that 
require greater instructional effort. Counter examples were 
extremely rare; for example, we were told by one respondent 
that districts are told not to worry about subpart scores 
where they do poorly if that element is not emphasized in 
their local curriculum or is taught at a later time. 

When asked, "Do you think that teachers spend more time 
teaching the specific objectives on the test (s) than they 
would if the tests were not required?", the answer from the 
40 high-stakes states was nearly unanimously, yes. The 
majority of respondents went on to describe the positive 
aspects of this more focused instruction. "Surely there is 
some influence of the content of the test on instruction. 
That's the intentional and good part of the testing 
probably." And in another state, "I can only tell you that 
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the people I've talked to, and it is certainly not a 
representative sample, have indie a ed that in fact the 
presence of the test is forcing attention to the essentia.. 
skills that had been identified. " Other respondents 
(representing about one-third of the high-stakes states) also 
said that teachers were spending more time teaching the 
specific objectives on the test but cast their answer in a 
negative way: "Yes. There is some definite evidence to that 
effect. I don't know that I should even say very much about 
that. There are some real potential problems 
there. . .Basically the tests do drive the curriculum." 

The follow-up question, "To what extent do you think 
important objectives are given less time or emphasis because 
they are not included in the test?", elicited a less uniform 
response tut answers were consistent with the positive or 
negative valence to the preceding answer. For example, those 
who believed that focusing instruction was a positive effect 
of testing gave answers such *s the following: 

"Yes, that happens but in a minority of our schools." 

"They would teach c.he essential competencies even 
without the (norm referenced test)." 

"Until the students master the basic skills their 
experiences in other areas are limited or non- 
functioning anyway." 

"There's some tendency to narrow but the community 
keeps the pressure on for gifted education." 

Those who expressed more concern about the narrowing of 

instruction gave answers such as: 
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The answer is yes, but I have no idea. I'm not close 
enough to any data that would give me a clue on 
percentage. I certainly feel comfortable saying yes, 
that I think there has been a decreased emphasis. 
WHAT KINDS OF THINGS ARE OMMITTED OR DEEMPHAS I ZED ? I 
think it occurs two ways. One, within the subject, 
some of the higher level objectives suffer. That is, 
other than reading and math. 

Test selection to match curriculum and subsequent 
.shaping of curriculum to conform to the ttst are not regarded 
as illegitimate practices. For decades, it has been standard 
advice in measurement textbooks to select standardized tests 
on the basis of technical adequacy and congruence with 
local curricula. Aligning curriculum to follow the test 
can be defended in the general spirit of teaching to agreed 
upon goals; whether particular instances of this practice are 
defensible depends on the breadth of the test content and how 
extensively the tested objectives take over instruction. 
Although moasurement-driven instruction may not be desirable 
if one rejects an assembly-line conception of learning 
(Bracey, 1987; Shepard, 1988), it is not patently unethical 
to teach to the test objectives. 

In one sense it can be said that test-curriculum 
alignment does not lead to spurious test score gains. 
Students can be said to have learned more of the specified 
objectives. Narrowing of curriculum does, however, alter the 
meaning of normative comparisons. The original 
standardization sample did not have the benefit of such 
focused instruction. Students in the norming sample were 
apparently learning the tested content and other things as 
well when they took the unannounced test. One way of 
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thinking about the change in the meaning of norms is to 
recall the old anchor study where national probability 
samples of students were used to equate all of the 
standardized tests to each other (Jaeger, 1973) . When all 
tests are administered in naive conditions, i.e., where 
curricula have not been aligned, then the equating answers 
the question, "How would students who performed at percentile 
X on test A, do on test B?" As soon as schools begin to 
tailor instruction to a particular +e<*t, these equi 'alences 
no longer hold. As far as the public meaning of test scores 
in concerned, however, there is an implicit assumption made 
that these equivalences hold true. For example, if the 
average student in your local district were scoring at the 
60*-h percentile on the CAT, you would want to be able to 
assume that the district's performance would be roughly the 
same on the ITBS. More to the point, consider the political 
ethos associated with educational reform. When politicians 
learn that U.S. students do poorly on international 
achievement comparisons and install testing programs, they 
wish to assume that rising local scores are evidence that the 
achievement deficit has been remedied. But once curriculum 
has been aligned to the local test, there is no guarantee 
that apparent gains generalize to non-taught-to tests. 

Note that the provision of annual user norms moves in 
the opposite direction of the anchor study. The notion of 
naive test takers is abandoned and each test then develops 
its own population of users. A local district that used a 

15 
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test by maintained a broad curricular focus beyond the test 
domain, would be a disadvantage in such comparisons. 

Test Preparation 

Our questions about test preparation were intended to 
encompass a range of activities including content review, 
advice about test-wiseness skills, practice with unfamiliar 
formats, as well as more questionable practices that Phillips 
and Finn (1988) had in mind when they referred to teaching 
the test as distinct from teaching test objectives. 
Respondents' descriptions of typical test preparation 
practices most often began with advice to students to "get a 
good night's sleep" the night before the test. Next most 
frequent was the response that districts use the standard 
materials provided by test publishers. Especially, children 
in grades 1, 2, 3 and sometimes 4 are administered a formal 
practice test to acquaint them with test format demands. 
These materials are provided by the publishers, and unlike 
many practices that depart from the conditions of the 
standardization study, were a part of the normative test 
administration . 

Most states do not provide materials for test 
preparation (beyond those available from the publisher) nor 
do they provide guidelines as to what constitutes appropriate 
test preparation. Several of the states with state developed 
criterion-referenced tests distribute detailed item 
specifications to encourage teaching to the test objectives 

in 
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and distribute old forms of the state test to be used for 
student practice and remediation. Respondents at the state 
level were generally unaware of the extent to which local 
schools and districts engaged in content review or provided 
additional format practice for their students. 

when asked what they had observed as extreme instances 
of test preparation, responses included: 

"Some districts have picked up on SrnHng High which 
is not covered under our test security rules" 

"Once in a great while we find that people are using 
materials identical to our test." 

"Some districts have developed their own practice tests 
and have a time line for covering each objective." 

"They have courses designed to prepare for the (high 
school) tests." 

"Pep rallies (are held) prior to test week to psych 
kids up to do well on the test." 

One-time practice with test format, especially when 
these activities are consistent with standardization 
procedures, is not the cause of inflated test scores. 
However, repeated practice or instruction geared to the 
format of the test rather than the content domain can 
increase scores without increasing achievement. For example, 
Mehrens and Kaminski (1988) conducted a content analysis of 
the Srnring High test preparation materials published by 
Random House. They concluded that the materials were so 
similar to the test that practice with the Scoring m g h (CAT) 
is equivalent to giving the parallel form of the test as a 
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practice test and explaining all the answer choices to the 
students. Although the latter would be clearly unethical, 
many educators purchase Srnriny High without confronting any 

ethical issues because it is sold as instructional or review 
material . 

Test Security and Test Familiarity 

In two states security measures associated with the 
norm-referenced testing program were described by the state- 
level directors as lax; specifically, local schools were 
allowed to store testing materials in the school from one 
year to the next. These were the exceptions, however. The 
great majority of states described extensive security 
procedures intended to limit the exposure of test materials 
in the schools and to keep account of test booklets. The 
following excerpt from the Rhode Island Testing Coordinator's 
Handbook (198P) is an example of typical security 
precautions : 

1. Store materials in rooms or cabinets that are locked, 
and that are not readily accessible to large numbers 
of other people. 

2. Check all materials as you receive them to verify 
counts; have counts verified again when material are 
returned for storage. 

3. Keep all extra test materials in the secure location 
when they are not in use, (p. 8) . 
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We also asked state directors about their experiences 
with cheating and about procedures they used to detect 
anomalous results. Cheating has been exposed in several 
states but it is generally believed to occur in a very tiny 
percentage of schools (1-3%) . Only California has in place a 
computer-scanning procedure to detect significant numbers of 
erasures signaling that teachers might have redone answer 
sheets after they were turned in by students. Using this 
procedure, the California Assessment Program announced last 
September the names of 40 elementary schools (among 5000) 
suspected of cheating on their 1985-86 tests (Woo, 1988) . A 
number of states use computer-assisted or informal means to 
check for extraordinary gains from one year to the next and 
then inspect the materials to see if there is any evidence of 
tampering. The great majority of states do not have 
procedures to detect anomalous results. On rare occasions 
they receive phone calls where a parent or educator in a 
neighboring district complains about practices such as: 
giving a dittoed version of the test the day before for 
practice. or helping students during the test administration. 
Follow-up investigation may be handled by the state or the 
district; most often the test is readministered when an 
invalid administration occurs. 

Although test materials are kept under lock and key and 
reported instances of cheating are rare, typical norm- 
referenced testing practices do not conform to the type of 
rigid security associated with programs such as college 
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admissions testing. With some help from counselors, 
standardized tests are usually administered by classroom 
teachers. The same form of the test is administered year 
after year. Table 2 provides a summary of both norm- 
referenced and criterion-referenced testing programs with an 
indication as to how long the identical test has been used. 
Given that publishers follow a cycle of test revision every 7 
or 8 years, it is not surprising that a few of the testing 
programs have had their tests in place for 6 or 7 years. 

We speculated that test familiarity might allow teachers 
to improve the performance of their students innocently 
without consciously deciding to cheat by xeroxing a copy of 
the test. For example, suppose you couldn't help but 
remember several of the vocabulary items on the test or you 
chose to do a science unit on one of the animals discussed in 
a text reading passage. Perhaps you were distressed during 
the test last year when your third graders were asked to do 
money problems in a format they had never seen before, so you 
decided to use examples of that format from now on. To 
assess how much impact test familiarity could potentially 
have on scores, we used published norm tables for grades 3 
and 6 for two of the most prevalently used tests, the 
Stanford Achievement Test (SAT) and the California 
Achievement Tests (CAT) , and looked up the conversions of 
number right scores to percentile ranks. At the median in 
reading, language, and mathematics, one additional item 
correct translates into a percentile gain from 2 to 7 points. 
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This means that teachers could relatively innocently teach to 
just a few items and raise achievement points by several 
percentile points. For example, on the CAT, Form E, 
Vocabulary constitutes half of the total Reading score. At 
third grade, someone at the 49th percentile will increase to 
the 54th percentile for one more item correct on the 
Vocabulary subtest. Suppose that half of the class already 
knows the vocabulary words the teacher has remembered (or 
would know them in the ordinary course of instruction), then 
the teacher only has to be sure that the rest of the class 
learns two vocabulary items to increase the class standing on 
the Vocabulary subtest by five percentile points. 

The old complaint against norm-referenced tests used to 
be that they are insensitive to instruction. They were 
constructed to represent relatively broad content domains; 
items were thought of as samples from this broad domain. It 
would take an enormous amount of instruction aimed at the 
full domain to move the class average by a single item. Our 
examples from the norms tables illustrate, however, that 
teaching to specific items is enormously more efficient. In 
this sense, norm-referenced tests are quite sensitive or 
vulnerable to teaching to specific items. 

The Solution Should Be Fresh Tests Not Annual Norm* 

Interview data cannot support a calculation to sort out 
precisely how much of apparent test score gains are real and 
how much are spurious. Our data do suggest that the 
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conditions for inflated results exist, in some cases to a 
marked degree. Forty of the 50 states administer high-stakes 
state testing programs which place some amount of pressure on 
teachers, principals, and superintendents to raise scores. 
There is substantial documentation of test-curriculum 
alignment. Practices which can be described as teaching the 
test rather than the test objectives exist in every state to 
some unknown degree. Each of these factors will affect the 
validity of local scores and will also distort the meaning of 
annual user norms. 

When Phillips and Finn (1988) discussed annual user 
norms as a solution to outdated norms, they were very clear 
that these norms should be representative of the national 
population; presumably they were concerned that the sample 
not be biased with respect to demographic characteristics. 
Thus far, however, the discussion of annual norms has not 
confronted the issue of what it would mean to adopt a 
conception of a norming population where everyone is teaching 
to the same test. Consider what it would mean to try to 
interpret relative standing in a population of users. In any 
normative comparison, a district is at a disadvantage if it 
plays fair by teaching to a broad curricular domain and 
avoiding more than one-time practice on test format. This 
disadvantage would be exaggerated, however, in a comparison 
among users rather than when each is compared to the naive 
norming sample. There is no way to assure that users in the 
norm group will teach to the test to the same degree, not 



that all will avoid unethical practices. Sources of invalid 
gains will then be built into the norm. The standard of 
comparison based on user norms would be spurious and 
inflatioi vary. 

Conclusion 

In this study we have been concerned primarily with what 
test -curriculum alignment and teaching the test might do to 
the meaning of scores. There is ample evidence here and 
elsewhere, however, that these practices harm instruction and 
learning as well. For example, Darling-Hammond and Wise 
(1985) found that teachers abandon the use of essay tests 
because they are inefficient in preparing students for 
multiple-choice tests. In the early childhood field the 
rising number of kindergarten retentions is associated not 
just with direct kindergarten promotion tests, as in the 
Georgia example, but with concerns about protecting the 
school's performance on standardized tests as remote as third 
grade (National Association for the Education of Young 
Children, 1988; National Association of State Boards of 
Education, 1988; Shepard & Smith, 1988) . If high-stakes 
pressure is already distorting instruction, what will happen 
if schools are to be evaluated in comparison to an inflated 
and escalating norm? 

An obvious alternative, suggested by two of the original 
respondents to Cannell, is to develop new tests every year. 
Publishers could consider using the same equating procedures 
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that allow several versions of the SAT and ACT to be used 
every year. Such proposals are apparently rejected out-of- 
hand because the costs are thought to be prohibitive. High 
school students pay $11.50 each to t v .e the ACT. Counting 
the amortized cost of the initial purchase of booklets, 
districts pay $3.50 or more per pupil per year for off-the- 
shelf standardized tests. Obviously, states and districts 
would not be willing to maintain their current programs at 
three times the cost. But the more that the integrity of 
scores becomes an issue, the more that they might be willing 
to consider testing one-third as many grade levels or 
different subjects every year. Other solutions include 
sampling procedures (pupil sampling or matrix sampling) , that 
reduce both the incentive and means to teach the test, or 
state developed tests such as writing assessments and student 
portfolios that make greater effort to capture in the 
assessments the full extent of learning domains. States that 
have not been able to afford to develop their own tests might 
consider joining consortia to create tests both with more 
expansive content and with procedural safeguards such as 
multiple forms to prevent teaching to test items. 
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Table I 

Expiations for Spuriously* Hifb Achievement Scorn from Responses to the Cancel] Report 



Ptiillipt ft Finn 
US.Dept ofE4uc 



Drshoxal ft Frubfc 
Riverside Publishing 



lenkeft Keene 
Psychological Corp. 



WitUtms 
CrB/McOrsw-Kill 



Quelh-Payne 
SRA 



Stonehttl 
U.S. Dept. of Educ 



I) N on representativeness Non-representativeness 
of national norms; of national norms; 
overref mentation of undem p rc scotatio* of 
test users leads to high scoring districts 

ipungyilY high Infli m mTimteJra 

norms. normy . 



2) Curriculum alignment 
in test selection gives 
in advantage over 
norrrung condition. 



Users outperform non~ 
asm in umfle. 



Cunicutacn alignment 
will trad to overestimate 
of pupJ standing 
(«e5). 



Ten earn selecting 
ten matched to 
curteshBh have w 
advantage. 



Non-representative norm 
would be "too easy* if tht 
rely too much on oomp» 
pentatory education 
students. 



3) High stakes pressure 
creates more motiva- 
tion thsn m norming 
condition. 



Job retention ft salary 
increases *ied to scores 
(see 5). 



4) Outda r ed norms. 



Solutions ^ annual 
user-norms if rep- 
resentative. 



Recency of norms. 
1970s norms are 
"softer" than 1980s 
norms. 



T fanning ior annua] 
norms. 



Achievement is going 
up. But changing norms 
too often would create 
a "moving target". 



Norming cycles are 
well known; more 
above median scores 
■re valid measures 
of rising achieve- 
ment. 

Annual norms for 
users. 



User-based norms to 
monitor achievement 
trends and signal need 
for renorating. 



5) Teaching the test 
(rather than the 
tt objective*). 

Solution Jt freih 
tests more often 



Teaching the test 
Practice and narrow* 
trig of ths curriculum. 

Solution « test security. 



New test each year would 
reduce need for annual 
norms. 



6) Non -comparability 
of samples; more 
holding out of low 
scoring students in 
user sample than in 
norming sample. 



Difference I 
tested ft to"~ <~*vlled 
population/ 



Handicapped ft limi 
English speakers nu 
not be excluded by 
; guidelines. 



Adequacy of expecta- 
tions based go socio- 
economic factors ft 
expenditures. 



8) 



NRTs originally Blended to evaluate ptipd scores. 



Accuracy of compari- 
sons pupil vs. district 
averages. 



Interpreting group per* 
formance relative to 
national pupil norms. 



* Some authors' responses to CaoneQ disputed his facta and statistics or argued that the test 
included in Table I which fummsrises only the explanations given for spunocsty high Mores, 



score gains arc real rather than sptarir^s. Such responses are not 
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Tabic 1 (DRAFT)* 



-I 



wit m 
they 



State 


Same 
Test 


Since 


Gives by 
Teaebtr 


Nttta 


AJf ^im^ 

SAT70LSAT 


Yes 


1984 


Yo 


Use alternate form for teacher review 


Various 


Yes 




Ye* 


Local totria choice of tests; variable number of vears in use. 


Arizona 

1 1 D*// 1 Ae?ft. 


Vm 


t QftA 

lytie 


TCS 




Arkansas 

State test 
MAT-* 


Yea 
Yes 


198. 
1985-86 


Yes 
Yes 


Becominai familiar* we ivst chanted from SRA 


California 

Sute test 


Yes* 


Variable** 


Yes 


•3<M0 forms reused, 
cade 1. 19*4: trade 11 198? (!976> 


Colorado 

1 1 par J nr 


Yes 


1 you 






Connecticut 


um iwfvc 

onlv 




* es 


nun a id kuwii ivw niwc*yp 


Delaware 

CTBS 


Yes 


1986 


Yes 




Florida 

Suie test 








Multiple forms ;differeru grades.dtffercnt subjects each vear 


vvcorgis 

ITBS/TAP 


Yes 


1986 


Yd 




SAT 


Yes 


1982 


Yes 




aOttltV 

ITBS/TAP 


Yes 


1986-8? 


Yes 


Grade 6 & 8. 198?. «ra<k 11. 198? 


Illinois 

wWC IE a I 






Yet 


ftftliftflo ifitfnc 


Indiana 

v w» •wiu*cu 
CAT* 






1 c* 


run jC«r qi in v£i wit. 

Test of cognitive skills wit? be the same each vear 


Iowa 

rres 


Yet 


1985-86 


Yes 




Kansas 

Staie test 


Kentucky 

Customized 


Yes 


198! 


Yes 




Louisiana 

CAT 


Yes 


1987-88 


Yes 


Switched tests this vear 


Maine 

Sute test 






Yes 


Matnx sampling: 33% turnover ttems/vear 


Maryland 
CAT 


Yes 


198142 


Yes 




Massachusetts 
State test 






Yes 


3 600 items: replace 20*30%. 


Michigan 

State test 


Yes 


1980 


Yes 





Minnesota 
Staie ten 



Planning to change test items 



O ■ 

:RLC 
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Miss isstppt 
SAT 


Yes 


1991 


Yes 


Not sdmininered every vesr (?) 


Missouri 

5 me lest 




1987-88 


Yes 


New fomts with routine torn. 


Nevada 

SAT 


Yet 


1984-8S 


Yet 


Grade 9. 1987 


New Hampshire 

CAT 


Ye* 


1985 


Yes 


Early fail leasing measss teachers arc not identifying with 
resulu. 


New Jersey 

Sute MC test 






VP 

Ye* 


OW versions are used for remediation. 


New Mexico 
CFBS 


Ye* 


lysi 


YiS 




New York 

Sute test 






Yes 


Bfttw mary test new every $ yean. 
Wan school new everv vear. 


No. Carolina 

CAT 


Ye* 


1986 


Yes 




No. Dakota 

SRA/IT8S 


Yet 




Yes 


Compilation of local norm- referenced tests. 


Oklahoma 

MAT^ 


Yea 


1986 


Yes 




Oregon 

State test 






Ye* 


New each year: sample of schools. 


Pennsylvania 
Sute test 






Ye* 


Some old and new items each vear. 


R>?oe Island 
MAT-6 


Yes 


1986 


Yes 




So. Carolina 
CTBS 


Ye* 


198^ 


Yes 




South Dakota 
SAT/TASK 


Ye* 


1985 


Ye* 




Tennessee 

SAT/TASK 


Yes 


1985 


Ye* 




Tesas 

Sute test 






Ye 


As much as 50* tame items. 


Utah 

CTBS 


Yes 


1984 






Virginia 

SRAJTTdS 


Yes 


1988 


Ye* 




it t -_a_ ' — m 

Washington 

MAT 


Yes 


1985 


Yes 




West Virginia 
CTBS 


Yes 


1984-85 


Yes 




Wisconsin 
CTBS 


Yd 


1982 


Yes 


Phasing out after 1938. 


Wyoming 

Concurrent NAE? 


* Follow-up phone calit j 


mt scheduled to coeftnn son 


e of the cams* 


to Table 2- Please report errors to the author. 
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